Agenda
- A brief review of non-Gaussian models.
- The Motivations for Mixed Effects Models
- How to fit Mixed Effects Models.
library(tidyverse)
library(broom)
library(ggplot2)
library(lme4)
library(multivariate2017)
Brief review of non-gaussian models
We’re going to briefly cover non-Gaussian models, in order to avoid the symptom of “when you have a hammer, everything looks like a nail.”
When not working with continuous data, the most common kind of data you’re likely to work with is count data of some kind. This count data can have two different qualities.
- Rate - How often did something happen at all, or per some unit of time.
- Proportion - How often did something happen out of all possible instances it could have happened.
These two kinds of data call for different generalized linear models.
- Rate - use a Poisson model.
- Proportion - use a Logistic model.
We’ll walk through these briefly using the example of filled pauses.
Filled Pauses
Sometimes when speaking, speakers experience so-called “disfluencies.” Among disfluencies, filled pauses have been treated as being a special sort. In American English Writing these are usually written out as <um> and <uh>, and in British English Writing they’re usually written out as <erm> and <er>. While the exact quality of the vowel can differ between varieties and dialects, we’re going to focus on just the alternation between whether it ends in an /m/ or us just an unstressed vowel.
The data here is drawn from naturalistic sociolinguistic interviews carried out in Philadelphia. You can access the data yourself if you run
library(devtools)
install_github("jofrhwld/UhUm")
library(UhUm)
Poisson regression
For every speaker, we’ll first model how often they used filled pauses at all. This will involve counting up all of their filled pauses, as well as how many total words they spoke. It’s necessary to count up how many total words were spoken, because not all interviews were the same length, meaning speakers had differential opportunity to use filled pauses. The duration of the interview is the exposure variable.
fp_count <- um_PNC %>%
mutate(age0 = (age-20)/10)%>%
group_by(idstring,
age0,
sex,
nwords) %>%
count()
We’ll look at the effect of age and sex on the overall rate of filled pauses like so:

fp_model <- glm(n ~ age0 * sex,
offset = log(nwords),
data = fp_count,
family = poisson)
summary(fp_model)
Call:
glm(formula = n ~ age0 * sex, family = poisson, data = fp_count,
offset = log(nwords))
Deviance Residuals:
Min 1Q Median 3Q Max
-21.109 -3.796 -1.105 2.506 25.009
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.250434 0.013831 -307.31 < 2e-16 ***
age0 0.050502 0.004088 12.36 < 2e-16 ***
sexm 0.166064 0.019841 8.37 < 2e-16 ***
age0:sexm 0.030427 0.005966 5.10 3.39e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 10828.6 on 394 degrees of freedom
Residual deviance: 9985.3 on 391 degrees of freedom
AIC: 12181
Number of Fisher Scoring iterations: 5
Without exactly understanding what the exact numbers mean, we can walk away with this understanding:
- the large negative intercept means that filled pauses are relatively rare (but we should compare this to other words to be sure)/
- The positive age effect means that as people get older, their rate of filled pauses increases.
- The positive sex effect means that men use filled pauses a bit more often than women.
- The positive interaction effect means that the age effect is a quite a bit stronger for men.
The specific estimates for the model are:
# Filled pauses per 1000 words at age0==0, sex==f
exp(-4.250434) * 1000
[1] 14.25804
# Filled pauses per 1000 words age0==2, sex==f
exp(-4.250434 + (2*0.050502)) * 1000
[1] 15.7734
# Filled pauses per 1000 words at age0==0, sex==m
exp(-4.250434 + 0.166064) * 1000
[1] 16.83374
# Filled pauses per 1000 words at age0==2, sex==m
exp(-4.250434 + 0.166064 + (2 * (0.050502 + 0.030427 ))) * 1000
[1] 19.79132
Logistic regression
If we’re interested not just in their overall rate of filled pause use, but rather when they used a filled pause, which one did they use, we would fit a logistic regression.
um_df <- um_PNC %>%
mutate(is_um = (word == "UM") * 1,
age0 = (age-20)/10)
um_proportion <- um_df %>%
group_by(idstring, age0, sex)%>%
summarise(um_prop = mean(is_um))

um_model <- glm(is_um~age0*sex,
data = um_df,
family = binomial)
summary(um_model)
Call:
glm(formula = is_um ~ age0 * sex, family = binomial, data = um_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4803 -0.7055 -0.4725 -0.1821 2.9013
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.19017 0.02852 6.669 2.58e-11 ***
age0 -0.41545 0.01018 -40.806 < 2e-16 ***
sexm -0.99217 0.04599 -21.571 < 2e-16 ***
age0:sexm -0.09844 0.01988 -4.953 7.32e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27657 on 25522 degrees of freedom
Residual deviance: 23366 on 25519 degrees of freedom
AIC: 23374
Number of Fisher Scoring iterations: 5
The rationale behind LMEs
The original reasoning behind moving to linear mixed effects modelling has a lot to do with sophisticated statistical reasoning which I’ll just briefly discuss. Let’s return to our ayT data. Recall, this is a collection of /ay/ pronunciations gathered from recorded interviews with 326 speakers from Philadelphia. Today, we’ll be looking at the effect of the vowels’ duration on the normalized F1.
To start off, we’ll be transforming the duration predictor. Specifically, I’ll be log2 transforming it, then subtracting the median. In mathematical notation that’s
\[
\text{dur_c} = \log_2(\text{duration})-median(\log_2(\text{duration}))
\]
The reason I’m doing this transformation is because duration values tend to have a long, rightward tail. Log transforming the duration helps make the distribution a bit more symmetrical.
A log transform of any base would do the job, actually (e.g. natural log, or log() or log10()). I chose log2 because of the effect it has on interpreting the slope. Every increase of 1 on the log2 scale corresponds to doubling the duration. So, instead of saying “the effect of log(duration) is X”, we can say “the effect of doubling duration is X”.
It’s important to note that I’m also only doing this because I don’t have a specific theory about why the effect of duration should be additive. Some information processing theories have specific predictions about additive effects of reaction times, and in those cases you maybe shouldn’t log RTs (Lo & Andrews 2015).
# filter() gets just the subset of the data that is pre-voiceless
# I log2 and center the duration in two steps inside mutate().
ayT <- ay %>%
filter(plt_vclass == "ay0") %>%
mutate(log2_dur = log2(dur),
dur_c = log2_dur - median(log2_dur),
dob = year-age,
decade = (dob-1900)/10)
ayT %>%
ggplot(aes(dur_c, F1_n))+
geom_point(alpha = 0.05)+
stat_smooth(method = 'lm', color = 'red')+
theme_bw()

You may not even be able to tell that the regression line has has a confidence interval because it is narrower than the width of the line! That’s because the model has low uncertainty about the slope:
summary(lm(F1_n ~ dur_c, data = ayT))
Call:
lm(formula = F1_n ~ dur_c, data = ayT)
Residuals:
Min 1Q Median 3Q Max
-2.4731 -0.3900 -0.0324 0.3498 5.2217
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.487737 0.004161 117.22 <2e-16 ***
dur_c 0.435710 0.007068 61.65 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.578 on 19340 degrees of freedom
Multiple R-squared: 0.1642, Adjusted R-squared: 0.1642
F-statistic: 3800 on 1 and 19340 DF, p-value: < 2.2e-16
But in an important sense, this model is overly confident. It thinks it’s working with 19,342 independent observations. As if 19,342 different people walked up to a microphone and said ay. But that is exactly not what happened here. Instead, a smaller number of speakers (326) had many tokens of ay, and not always the same number of tokens each.
To make the example more extreme, let’s imagine we had data from only two speakers, one who produced very many tokens, and who who produced relatively few.

Based on just these two speakers, it would be hard to say what the real effect of duration on F1 is. One speaker has a fairly positive slope, and the other has a fairly negative slope. However, if you fit a model with just these two speakers’ data in the model, the estimated slope would be be dominated by the speaker with a lot of data, and almost none given to the speaker with very little.
Call:
lm(formula = F1_n ~ dur_c, data = .)
Residuals:
Min 1Q Median 3Q Max
-1.48608 -0.57665 -0.00132 0.63510 1.62271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1836 0.1532 7.726 1.61e-08 ***
dur_c 0.8775 0.2800 3.134 0.00393 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.85 on 29 degrees of freedom
Multiple R-squared: 0.253, Adjusted R-squared: 0.2272
F-statistic: 9.822 on 1 and 29 DF, p-value: 0.003926
However, this is not ideal, since we should give a bit more weight to the fact that at least one speaker has a weaker effect, even if they have less data. An even more perverse situation could arise if there’s a strong correlation between a the speaker and their intercept. For example, in these figures it looks like the relationship between duration and F1 is positive, but it’s conceptually possible that the relationship within every individual is actually negative if they were arranged like this figure:

This is also known as “Simpson’s paradox”
Technically Speaking…
This data set violates the assumption that the observations are independent and indentically distributed (i.i.d.). The data points are not all independent, because the pronunciations from within a speaker are going to be correlated. The data points are not all identically distributed because each speaker has their own characteristic distribution of data.
Thinking about Pooling
One way to start thinking about mixed effects models is to think about how you are “pooling” the data from your data sources. In this example, we’re talking about speakers, but this could also be study sites, lexical items, experimental subjects, stimulus items, etc.
Complete Pooling
What people have been doing for a long time is “complete pooling.” All of the data from all of the sources all just goes into one big data bucket. You then fit a model with all of this data in the bucket, undifferentiated by who contributed what data point.
For the reasons we’ve discussed already, this isn’t a great approach.
- It violates some statistical assumptions.
- The parameter estimates are over-confident.
- The parameter estimates are too skewed towards speakers/words with more data.
- It is possible for the structure of the data to give you exactly the wrong estimate.
No Pooling
One way you could try to resolve some of these problems would be to fit one model for every speaker. Here, you don’t pool the data at all. Each speaker just has their own bucket of data.
We’ll actually do this in a second, and then discuss the pros and cons.
Partial Pooling
The final option is to do “partial pooling”. You model with all of the data, but preserve information about which speaker contributed which data. This approach allows the “model” for each speaker to mutually inform each other.
This is a “mixed effects model”, which we’ll also look at.
Fitting no-pooling models
In this section, we’re going to fit some no-pooling models. It is not necessary to be able to do this to be able to fit mixed effects models, but I think this can help you understand what mixed effects models are doing.
While the no-pooling approach to modelling isn’t the most sophisticated (and you definitely can’t use their p-values), I think it is often a good idea to try it out as an exploratory technique. You can check out a talk by Hadley Wickham about it here.
It is pretty easy to fit one model for one speaker. We just take a subset of the data and fit the model:
speakers <- unique(ayT$idstring)
sp1 <- ayT %>%
filter(idstring == speakers[1])
sp1_mod <- lm(F1_n ~ dur_c, data = sp1)
summary(sp1_mod)
Call:
lm(formula = F1_n ~ dur_c, data = sp1)
Residuals:
Min 1Q Median 3Q Max
-1.06428 -0.32610 -0.01164 0.34649 1.15681
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.30902 0.05851 5.281 8.68e-07 ***
dur_c 0.39734 0.08841 4.494 2.05e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4442 on 91 degrees of freedom
Multiple R-squared: 0.1816, Adjusted R-squared: 0.1726
F-statistic: 20.2 on 1 and 91 DF, p-value: 2.05e-05
There is even a handy dandy function in the broom package called tidy() that will convert a linear model object (and many others) into a data frame.
tidy(sp1_mod)
But there are 326 speakers in this data set. We can’t go repeating this procedure that many times. I will be using tools from the tidyverse to fit one model for each speaker, and combine them all together. The exact details of how this works goes a bit beyond what we have time to cover here today, but if you are interested in how it works you can download the source code for this lecture from the Code tab at the top of the page.
The results of the no-pooling model is plotted below. For every speaker, I’ve plotted their estimated intercept along the x-axis and their estimated slope along the y-axis.

This no-pooling approach is informative, but also has some serious weaknesses. First of all, we can see that most speakers have the expected slope (on the y-axis) between duration and F1. That is, the slope is positive, meaning as duration increases, F1 increases (lowering the vowel). However, there is a lot if inter-speaker variability. Some speakers are estimated to have a fairly steep slope, and others have slopes closer to 0 or even negative.
Here’s a plot of the actual lines these parameters correspond to.

Discussion
What are the pros and cons of this no-pooling model?
Partial Pooling (Mixed Effects)
The biggest downside to the no pooling approach is that there is no shared information between the models. It’s possible that speakers who had a negative slope were only estimated to have a negative slope because they had very little data. If there were some way to incorporate information about the slopes of every other speaker into these speakers’ slope estimates, they might be considerably different.
This is exactly what Mixed Effects Models do. You define “grouping variables” or “pooling variables” or “random effects” in the model to tell the model to estimate effects for each speaker, but not to do so completely independently for each speaker.
~2 Minute Activity
Fit a complete pooling model with lm and a partial pooling model with lmer with the following code.
ayT_lm <-lm(F1_n ~ dur_c, data = ayT)
ayT_lme <- lmer(F1_n ~ dur_c + (1 + dur_c | idstring), data = ayT)
The big difference between the complete pooling and the partial pooling models are the definition of the “random effects” structures.
- Everything to the left of the
|
- Everything that can vary within the grouping variable.
1 refers to the intercept
- Everything to the right of the
|
With (1 + dur_c | idstring ), we’re saying
Speakers can have variable intercepts, and variable slopes.
Let’s look at the summary of the mixed effects model.
summary(ayT_lme)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ dur_c + (1 + dur_c | idstring)
Data: ayT
REML criterion at convergence: 27306
Scaled residuals:
Min 1Q Median 3Q Max
-5.6543 -0.6010 -0.0430 0.5386 10.3140
Random effects:
Groups Name Variance Std.Dev. Corr
idstring (Intercept) 0.10060 0.3172
dur_c 0.01769 0.1330 -0.12
Residual 0.22564 0.4750
Number of obs: 19342, groups: idstring, 326
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.58113 0.01827 31.81
dur_c 0.34284 0.01071 32.00
Correlation of Fixed Effects:
(Intr)
dur_c -0.092
The “Random Effects”
Strictly speaking, the mixed effects models aren’t estimating the actual speaker-level slopes and intercepts as proper parameters of the model. Instead, it’s estimating the covariance matrix of the slopes and the intercepts. In this summary it’s showing us
- The variance and standard deviation of individual level intercepts.
- The variance and standard deviation of individual level slopes.
- The correlation between the intercepts and slopes.
- The variance and standard deviation of the remaining residuals.
Here, it looks like speakers are more variable with respect to their baseline F1 than they are with respect to the effect duration has on their F1. And, the larger a speakers’ baseline F1 is, their effect of duration is somewhat weaker, but it’s not a very strong correlation.
Fixed Effects
These are the estimated “fixed effect”. They can be interpreted a lot like the coefficients of the models we’ve been fitting already, in the sense that these are part of the formula to draw a line. Another way is to interpret these as the average intercept and slope between speakers.
Looking at the random effects
It is possible to extract estimates of then random effects using the function ranef(). Strictly speaking, these are “BLUPs”, which is stands for “Best Linear Unbiased Predictors”. The point here is that these aren’t proper parameters of the model, but the distinction here is a bit arcane for an introductory course.
head(ranef(ayT_lme)$idstring)
These are the predicted differences in Intercept and slope for each speaker from the fixed effects. That is, if we added the (Intercept) value from the fixed effects to these random effects, we would get the predicted slope for each speaker. Let’s do that:
ayT_lme_fixef <- fixef(ayT_lme)
ranef(ayT_lme)$idstring %>%
ggplot(aes(`(Intercept)` + ayT_lme_fixef[1],
dur_c + ayT_lme_fixef[2]))+
geom_point()+
geom_hline(yintercept = 0, linetype = 2)+
geom_vline(xintercept = 0, linetype = 2)+
theme_minimal()+
coord_fixed()


We can get an idea of what the mixed-effects modelling has done by comparing these results to the no-pooling model. First, the clustering of individual speakers’ random effects is tighter. Also, no speaker is estimated to have an individual slope less than 0.
It is also possible to see the positive correlation between the random effects. Speakers who have a baseline lower ayT appear to be more affected by duration changes.
Multiple random effects
This is where it gets impossible to make illustrative graphs. It is also possible to include multiple grouping factors in a mixed effects model.
~5 minute break
We’re going to add word as a grouping factor. For now, we’ll treat word as only being able to modulate the baseline F1_n, which means we’ll add it with (1 | word) (a.k.a. a random intercept).
Go ahead and fit this model:
ayT_lme_multi <- lmer(F1_n ~ dur_c +
(1 + dur_c | idstring) +
(1 | word), data = ayT)
summary(ayT_lme_multi)
What happened?
However, it is possible to include a random slope by word too, so let’s do that as well:
ayT_lme_multi2 <- ayT_lme_multi <- lmer(F1_n ~ dur_c +
(1 + dur_c | idstring) +
(1 + dur_c| word), data = ayT)
summary(ayT_lme_multi2)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ dur_c + (1 + dur_c | idstring) + (1 + dur_c | word)
Data: ayT
REML criterion at convergence: 26612
Scaled residuals:
Min 1Q Median 3Q Max
-6.1022 -0.5875 -0.0406 0.5347 10.2437
Random effects:
Groups Name Variance Std.Dev. Corr
idstring (Intercept) 0.100144 0.31645
dur_c 0.018412 0.13569 -0.17
word (Intercept) 0.059942 0.24483
dur_c 0.005643 0.07512 -0.01
Residual 0.214192 0.46281
Number of obs: 19342, groups: idstring, 326; word, 255
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.66763 0.02868 23.28
dur_c 0.30469 0.01948 15.64
Correlation of Fixed Effects:
(Intr)
dur_c -0.176
One thing you might notice now is that the “fixed effect” of duration has gotten a bit smaller.
# complete pooling
coef(ayT_lm)
(Intercept) dur_c
0.4877373 0.4357102
# full ranef structure
fixef(ayT_lme_multi2)
(Intercept) dur_c
0.6676302 0.3046943
This may look scary, especially if the effect size gradually vanishes away, but this is exactly how mixed effects models are supposed to work. In some sense, in the complete-pooling model, we thought we were modelling the effect of duration, but actually some of that effect was actually due to the peculiarities of the words spoken, and who the speakers were.
When to add random intercepts/slopes?
Barr et al (2012) recommend (command?) fitting models with “maximal” random effects structures. That is, for every possible grouping factor, include a random slope for every grouping factor that’s logically possible. I think this is a good starting point, but isn’t always practical.
Logically Possible & Impossible Random Slopes
In the ayT model we just fit, it was possible for us to fit a random slope of duration by speaker and word, because speakers produced tokens with different duration, and words were produced with different durations.
However, if we wanted to include voicing again (pre-voiceless vs elsewhere), we could include a random slope of voicing by speaker, but not by word. This isn’t a prohibition, it’s just not logically possible and might blow up your model. That’s because each word only every has one voicing context. In the word, NICE, the following /s/ is voiceless, and always will be.
Similarly, if we wanted to include year of birth in the model, we could include a random slope of dob by word, because many words were spoken by people across many ages. But we couldn’t include a random slope of dob by speaker, because each speaker only has one date of birth.
Necessary Random Effects
Any time you are looking at a between-grouping factor variable, you should include the grouping factor with a random intercept. This is especially true for any kind of research done outside of an experimental lab context.
- age:
(1 | speaker)
- gender:
(1 | speaker)
- word frequency:
(1 | word)
- sentence grammaticality:
(1 | token)
Other necessary random effects would be related to the nature of the data. In this ah model, we haven’t explored any between-grouping factors, but all of these data points were produced by speakers, embedded in specific words, so it’s important to capture that in the random effects.
Recommended Random Effects
It is a good idea to include your biggest effects, or effects of greatest interest, as random slopes across the grouping factors which contain the most data.
The assumptions of LMEs
The way LMEs work is they make some strong assumptions abut the distribution of the random effects. Let’s look at the distribution of the parameters from the no-pooling model again (i.e. when we fit 1 model for each speaker).

These by-speaker parameters were estimated via maximum likelihood. That is, these are the estimated slopes and intercepts that best satisfy the linear regression formula.
The random intercepts and slopes are not estimated in this way. Rather, the model assumes that the random intercepts and slopes are drawn from a normal distribution. That is, the range of of possible random intercepts and slopes are constrained, so that they are consistent with being drawn from a normal distribution. This is what allows the estimates for each speaker to be “partially pooled.” Even if there is not very much data from a single speaker, we can still get some sense of what their individual slope ought to be if we we look at the mean and standard deviation of the slopes across all speakers.
This also leads to “shrinkage” of the estimated random effects. Compare the maximum likelihood estimates (in yellow) to the random effects (in grey) below. For the random slope especially, the range of values is much narrower.
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string

By and large, this shrinkage is a good thing. Let’s say that you ran a reaction time experiment, and then took your three subjects who pressed the button the fastest and the three subjects who pressed the button the slowest and re-tested them. Almost invariably, the three fastest subjects will be a bit slower on the second test, and the three slowest a bit faster. This is also known as “regression to the mean.” You could consider this shrinkage of the random effects to be “baking in” this phenomenon.
“Nuisance variables”
Another way to think about random effects is as “nuisance variables”. When modelling something as complex as why pronunciations change, there are always going to be factors missing from our models for a variety of reasons.
- there are things we know matter, but failed to measure them for some reason.
- there are things we know matter, but we don’t know how to measure them.
- there are things that may not be measureable.
- there are things we don’t realize matter.
If we don’t do our best to account for these “unmodellables”, they could very well mess with our fixed effects estimates. While they are not a perfect solution, random effects can go some way towards soaking up some of these unmodelable variables.
On the flip side, if you have something you know has an effect, include it in the model. For example, we know that gender has a big effect on many changes in progress. You should, therefore, include it in the model instead of hoping that the random effects will account for it. That’s because if there is a big between-subject variable that affects the outcome variable, the assumption that speaker random intercepts & slopes are drawn from a single normal distribution is violated.
---
title: "Introducing Mixed Effects Models"
output: 
  html_notebook: 
    code_folding: none
    css: ../custom.css
    theme: flatly
    toc: yes
    toc_float: yes
    toc_depth: 3
---

# Agenda

1. A *brief* review of non-Gaussian models.
2. The Motivations for Mixed Effects Models
3. How to fit Mixed Effects Models.


```{r}
library(tidyverse)
library(broom)
library(ggplot2)
library(lme4)
library(multivariate2017)

```

# Brief review of non-gaussian models

We're going to *briefly* cover non-Gaussian models, in order to avoid the symptom of "when you have a hammer, everything looks like a nail."

When not working with continuous data, the most common kind of data you're likely to work with is *count* data of some kind. This count data can have two different qualities.

- **Rate** - How often did something happen at all, or per some unit of time.
- **Proportion** - How often did something happen out of all possible instances it could have happened.

These two kinds of data call for different *generalized linear models*.

- **Rate**  - use a Poisson model.
- **Proportion** - use a Logistic model.

We'll walk through these briefly using the example of *filled pauses*.

<div class = "box break">
<span class = "big-label">Filled Pauses<span>

Sometimes when speaking, speakers experience so-called "disfluencies."
Among disfluencies, filled pauses have been treated as being a special sort. 
In American English Writing these are usually written out as <*um*> and <*uh*>,
and in British English Writing they're usually written out as <*erm*> and <*er*>.
While the exact quality of the vowel can differ between varieties and dialects, we're going to focus on just the alternation between whether it ends in an /m/ or us just an unstressed vowel.

</div>

The data here is drawn from naturalistic sociolinguistic interviews carried out in Philadelphia.
You can access the data yourself if you run

```{r}
library(devtools)
install_github("jofrhwld/UhUm")
```


```{r}
library(UhUm)
```

## Poisson regression

For every speaker, we'll first model how often they used filled pauses at all.
This will involve counting up all of their filled pauses, as well as how many total words they spoke.
It's necessary to count up how many total words were spoken, because not all interviews were the same length, meaning speakers had differential opportunity to use filled pauses.
The duration of the interview is the *exposure variable.*

```{r}
fp_count <- um_PNC %>%
              mutate(age0 = (age-20)/10)%>%
              group_by(idstring, 
                       age0, 
                       sex, 
                       nwords) %>%
              count()
```




We'll look at the effect of age and sex on the overall rate of filled pauses like so:

```{r}
ggplot(fp_count, aes(age0, (n/nwords)*100, color = sex))+
    geom_point()+
    scale_color_brewer(palette = "Dark2")+
    theme_bw()
```


```{r}
fp_model <- glm(n ~ age0 * sex, 
                offset = log(nwords), 
                data = fp_count,
                family = poisson)
```

```{r}
summary(fp_model)
```

Without exactly understanding what the exact numbers mean, we can walk away with this understanding:

- the large negative intercept means that filled pauses are relatively rare (but we should compare this to other words to be sure)/
- The positive age effect means that as people get older, their rate of filled pauses increases.
- The positive sex effect means that men use filled pauses a bit more often than women.
- The positive interaction effect means that the age effect is a quite a bit stronger for men.

The specific estimates for the model are:

```{r}
# Filled pauses per 1000 words at age0==0, sex==f
exp(-4.250434) * 1000

# Filled pauses per 1000 words age0==2, sex==f
exp(-4.250434 + (2*0.050502)) * 1000

# Filled pauses per 1000 words at age0==0, sex==m
exp(-4.250434 + 0.166064) * 1000

# Filled pauses per 1000 words at age0==2, sex==m
exp(-4.250434 + 0.166064 + (2 * (0.050502 + 0.030427 ))) * 1000
```


## Logistic regression

If we're interested not just in their overall rate of filled pause use, but rather when they used a filled pause, which one did they use, we would fit a logistic regression.

```{r}
um_df <- um_PNC %>%
            mutate(is_um = (word == "UM") * 1,
                   age0 = (age-20)/10)

um_proportion <- um_df %>%
                  group_by(idstring, age0, sex)%>%
                  summarise(um_prop = mean(is_um))
```

```{r}
ggplot(um_proportion, aes(age0, um_prop, color = sex))+
    geom_point()+
    scale_color_brewer(palette = "Dark2")+
    theme_bw()
```

```{r}
um_model <- glm(is_um~age0*sex, 
                data = um_df,
                family = binomial)
```


```{r}
summary(um_model)
```




# The rationale behind LMEs

The original reasoning behind moving to linear mixed effects modelling has a lot to do with sophisticated statistical reasoning which I'll just briefly discuss. Let's return to our `ayT` data.
Recall, this is a collection of /ay/ pronunciations gathered from recorded interviews with  `r length(unique(ay$idstring))` speakers from Philadelphia.
Today, we'll be looking at the effect of the vowels' *duration* on the normalized F1.

To start off, we'll be transforming the duration predictor. 
Specifically, I'll be log2 transforming it, then subtracting the median.
In mathematical notation that's 

$$
\text{dur_c} = \log_2(\text{duration})-median(\log_2(\text{duration}))
$$

The reason I'm doing this transformation is because duration values tend to have a long, rightward tail.
Log transforming the duration helps make the distribution a bit more symmetrical.

<div style = "width:100%;float:left;">
<div style = "width:45%;float:left;margin:auto;">
```{r echo = F, fig.width = 1, fig.height = 1}

ayT <- ay %>% filter(plt_vclass=="ay0")
ggplot(ayT, aes(dur))+
    stat_density(color = "black", alpha = 0.8)+
    theme_minimal()
```
</div>
<div style = "width:45%;float:left;margin:auto;">
```{r echo = F, fig.width = 1, fig.height = 1}
ggplot(ayT, aes(log2(dur)))+
    stat_density(color = "black", alpha = 0.8)+
    theme_minimal()
```
</div>
</div>

A log transform of any base would do the job, actually (e.g. natural log, or `log()` or `log10()`). 
I chose log2 because of the effect it has on interpreting the slope.
Every increase of 1 on the log2 scale corresponds to *doubling* the duration.
So, instead of saying "the effect of log(duration) is X", we can say  "the effect of doubling duration is X".

It's important to note that I'm also only doing this because I don't have a specific theory about why the effect of duration should be additive.
Some information processing theories have specific predictions about *additive* effects of reaction times, and in those cases you maybe shouldn't log RTs ([Lo & Andrews 2015](https://dx.doi.org/10.3389%2Ffpsyg.2015.01171)).


```{r}
# filter() gets just the subset of the data that is pre-voiceless
# I log2 and center the duration in two steps inside mutate().
ayT <- ay %>% 
          filter(plt_vclass == "ay0") %>%
          mutate(log2_dur = log2(dur),
                 dur_c = log2_dur - median(log2_dur),
                 dob = year-age,
                 decade = (dob-1900)/10)
```



```{r}
ayT %>% 
  ggplot(aes(dur_c, F1_n))+
    geom_point(alpha = 0.05)+
    stat_smooth(method = 'lm', color = 'red')+
    theme_bw()
```

You may not even be able to tell that the regression line has has a confidence interval because it is narrower than the width of the line! That's because the model has low uncertainty about the slope:

```{r}
summary(lm(F1_n ~ dur_c, data = ayT))
```
But in an important sense, this model is *overly* confident. It thinks it's working with `r formatC(nrow(ayT), big.mark = ",")` *independent* observations. As if `r formatC(nrow(ayT), big.mark = ",")` different people walked up to a microphone and said `ay`. But that is exactly not what happened here. Instead, a smaller number of speakers (`r length(unique(ayT$idstring))`) had many tokens of `ay`, and not always the same number of tokens each.

To make the example more extreme, let's imagine we had data from only two speakers, one who produced very many tokens, and who who produced relatively few.


```{r echo = F}
ayT %>%
  mutate(idstring = as.character(idstring))%>%
  filter(idstring %in% c( "PH84-2-6-", "PH85-2-2-"))%>%
  ggplot(aes(dur_c, F1_n))+
    geom_point()+
    stat_smooth(method = 'lm')+
    facet_wrap(~idstring)+
    theme_bw()
```

Based on *just* these two speakers, it would be hard to say what  the real effect of duration on F1 is.
One speaker has a fairly positive slope, and the other has a fairly negative slope.
However, if you fit a model with just these two speakers' data in the model, the estimated slope would be be dominated by the speaker with a lot of data, and almost none given to the speaker with very little. 

```{r echo = F}
ayT %>%
  mutate(idstring = as.character(idstring))%>%
  filter(idstring %in% c( "PH84-2-6-", "PH85-2-2-"))%>%
  lm(F1_n ~ dur_c , data = .)%>%
  summary()
```

However, this is not ideal, since we should give a bit more weight to the fact that at least one speaker has a weaker effect, even if they have less data.
An even more perverse situation could arise if there's a strong correlation between a the speaker and their intercept. For example, in these figures it looks like the relationship between duration and F1 is *positive*, but it's conceptually possible that the relationship within every individual is actually *negative* if they were arranged like this figure:


```{r echo = F}
point_line_distance <- function(b, m, x, y)
  (y - (m*x + b))/sqrt(m^2 + 1)

confounded_data_frame <- function(x, y, m, num_grp){
  b <- 0 # intercept doesn't matter
  d <- point_line_distance(b, m, x, y)
  d_scaled <- 0.0005 + 0.999 * (d - min(d))/(max(d) - min(d)) # avoid 0 and 1
  data.frame(x=x, y=y, 
            group=as.factor(sprintf("grp%02d", ceiling(num_grp*(d_scaled)))))
}

confounded_dur <- confounded_data_frame(x = ayT$dur_c,
                                        y = ayT$F1_n,
                                        m = -0.5,
                                        num_grp = 10)

confounded_dur %>%
  filter(group != "grp10")%>%
  ggplot(aes(x, y, color = group))+
    geom_point(alpha = 0.05)+
    stat_smooth(method = 'lm', se = F)+
    xlab("dur_c")+
    ylab("F1_n")+
    scale_color_discrete(guide = "none")+
    theme_minimal()

```

This is also known as "Simpson's paradox"

## Technically Speaking...

This data set violates the assumption that the observations are *independent and indentically distributed* (i.i.d.). 
The data points are not all *independent*, because the pronunciations from within a speaker are going to be correlated.
The data points are not all *identically distributed* because each speaker has their own characteristic distribution of data.


# Thinking about Pooling

One way to start thinking about mixed effects models is to think about how you are "pooling" the data from your data sources. In this example, we're talking about speakers, but this could also be study sites, lexical items, experimental subjects, stimulus items, etc.


## Complete Pooling
What people have been doing for a long time is "complete pooling." All of the data from all of the sources all just goes into one big data bucket. You then fit a model with all of this data in the bucket, undifferentiated by who contributed what data point.

![](figures/complete_pooling.svg)

For the reasons we've discussed already, this isn't a great approach.

1. It violates some statistical assumptions.
2. The parameter estimates are over-confident.
3. The parameter estimates are too skewed towards speakers/words with more data.
4. It is possible for the structure of the data to give you exactly the wrong estimate.


## No Pooling

One way you could try to resolve some of these problems would be to fit one model for every speaker. Here, you don't pool the data at all. Each speaker just has their own bucket of data.

![](figures/no_pooling.svg)

We'll actually do this in a second, and then discuss the pros and cons.


## Partial Pooling

The final option is to do "partial pooling". You model with all of the data, but preserve information about which speaker contributed which data. This approach allows the "model" for each speaker to mutually inform each other. 


![](figures/partial_pooling.svg)

This is a "mixed effects model", which we'll also look at.

# Fitting no-pooling models

In this section, we're going to fit some no-pooling models. It is not necessary to be able to do this to be able to fit mixed effects models, but I think this can help you understand what mixed effects models are doing.

While the no-pooling approach to modelling isn't the most sophisticated (and you definitely can't use their p-values), I think it is often [a good idea to try it out as an exploratory technique](http://jofrhwld.github.io/blog/2016/05/01/many_models.html).  You can check out [a talk by Hadley Wickham about it here](https://www.youtube.com/watch?v=rz3_FDVt9eg). 


It is pretty easy to fit one model for one speaker.
We just take a subset of the data and fit the model:


```{r}
speakers <- unique(ayT$idstring)

sp1 <- ayT %>%
          filter(idstring == speakers[1])
sp1_mod <- lm(F1_n ~ dur_c, data = sp1)
summary(sp1_mod)
```

There is even a handy dandy function in the `broom` package called `tidy()` that will convert a linear model object (and many others) into a data frame.

```{r}
tidy(sp1_mod)
```

But there are `r length(unique(ayT$idstring))` speakers in this data set.
We can't go repeating this procedure that many times. 
I will be using tools from the tidyverse to fit one model for each speaker, and combine them all together. 
The exact details of how this works goes a bit beyond what we have time to cover here today, but if you are interested in how it works you can download the source code for this lecture from the `Code` tab at the top of the page. 

The results of the no-pooling model is plotted below.
For every speaker, I've plotted their estimated intercept along the x-axis and their estimated slope along the y-axis.

```{r echo = F}
# Welcome, travelers...
# see also my notes on Split-Apply-Combine http://jofrhwld.github.io/teaching/courses/2017_lsa/lectures/Session_3.nb.html
# and my longer discussion of this section of notes: http://jofrhwld.github.io/teaching/courses/2017_lsa/lectures/Session_7.nb.html#fitting_no-pooling_models


ayT %>% 
  group_by(idstring)%>%
  nest()%>%
  mutate(model = map(data, ~lm(F1_n ~ dur_c, data = .x)),
         params = map(model, tidy),
         n = map(data, nrow) %>% simplify())%>%
  unnest(params)%>%
  select(idstring, term, estimate, n)%>%
  spread(term, estimate) -> params

```

```{r echo = F}
params %>%
  filter(is.finite(dur_c))%>%
  ggplot(aes(`(Intercept)`, dur_c))+
    geom_hline(yintercept = 0, color = "grey50")+
    geom_vline(xintercept = 0, color = "grey50")+
    geom_point(aes(size = n), alpha = 0.6)+
    scale_size_area()+
    theme_minimal()+
    coord_fixed()
```


This no-pooling approach is informative, but also has some serious weaknesses.
First of all, we can see that *most* speakers have the expected slope (on the y-axis) between duration and F1. 
That is, the slope is positive, meaning as duration increases, F1 increases (lowering the vowel).
However, there is a lot if inter-speaker variability.
Some speakers are estimated to have a fairly steep slope, and others have slopes closer to 0 or even negative.

Here's a plot of the actual lines these parameters correspond to.

```{r}
params %>%
  filter(is.finite(dur_c))%>%
  ggplot()+
    geom_abline(aes(intercept = `(Intercept)`, 
                    slope = dur_c),
                alpha = 0.1)+
    xlim(-1, 2.85)+
    xlab("dur_c")+
    ylim(-2.14,5.9)+
    ylab("F1_n")+
    theme_minimal()
```


<div class = "box break">
<span class = "big-label">Discussion</span>

What are the pros and cons of this no-pooling model?
</div>


# Partial Pooling (Mixed Effects)

The biggest downside to the no pooling approach is that there is no shared information *between* the models. It's possible that speakers who had a negative slope were only estimated to have a negative slope because they had very little data.
If there were some way to incorporate information about the slopes of every other speaker into these speakers' slope estimates, they might be considerably different.

This is exactly what Mixed Effects Models do. You define "grouping variables" or "pooling variables" or "random effects" in the model to tell the model to estimate effects for each speaker, but not to do so completely independently for each speaker.

<div class = "box break">
<span class = 'big-label'>~2 Minute Activity</span>

Fit a complete pooling model with `lm` and a partial pooling model with `lmer` with the following code.

```{r}
ayT_lm <-lm(F1_n ~ dur_c, data = ayT)
ayT_lme <- lmer(F1_n ~ dur_c + (1 + dur_c | idstring), data = ayT)
```

</div>

The big difference between the complete pooling and the partial pooling models are the definition of the "random effects" structures. 

<div class = "illustrate">
(<span class = "pop">1</span> + <span class = "pop">dur_c</span> | <span class = "pop">idstring</span>)
</div>

- Everything to the left of the `|`
    - Everything that can vary within the grouping variable. 
    - `1` refers to the intercept
- Everything to the right of the `|`
    - The grouping variable.
    
With `(1 + dur_c | idstring )`, we're saying

> Speakers can have variable intercepts, and variable slopes.

Let's look at the summary of the mixed effects model.

```{r}
summary(ayT_lme)
```


## The "Random Effects"

<div class = 'half-img'>

![](figures/ranef.png)
</div>

Strictly speaking, the mixed effects models aren't estimating the actual speaker-level slopes and intercepts as proper parameters of the model.
Instead, it's estimating the *covariance matrix* of the slopes and the intercepts.
In this summary it's showing us

- The variance and standard deviation of individual level intercepts.
- The variance and standard deviation of individual level slopes.
- The correlation between the intercepts and slopes.
- The variance and standard deviation of the remaining residuals.

Here, it looks like speakers are more variable with respect to their baseline F1 than they are with respect to the effect duration has on their F1.
And, the *larger* a speakers' baseline F1 is, their effect of duration is somewhat weaker, but it's not a very strong correlation.

## Fixed Effects

<div class = 'half-img'>
![](figures/fixef.png)
</div>


These are the estimated "fixed effect". They can be interpreted a lot like the coefficients of the models we've been fitting already, in the sense that these are part of the formula to draw a line. Another way is to interpret these as the *average* intercept and slope between speakers.

## Looking at the random effects

It is possible to extract estimates of then random effects using the function `ranef()`. Strictly speaking, these are "BLUPs", which is stands for "Best Linear Unbiased Predictors". 
The point here is that these aren't *proper* parameters of the model, but the distinction here is a bit arcane for an introductory course.

```{r}
head(ranef(ayT_lme)$idstring)
```

These are the predicted differences in Intercept and slope for each speaker from the fixed effects. That is, if we added the `(Intercept)` value from the fixed effects to these random effects, we would get the predicted slope for each speaker. Let's do that:

```{r echp = F}
ayT_lme_fixef <- fixef(ayT_lme)

ranef(ayT_lme)$idstring %>%
  ggplot(aes(`(Intercept)` + ayT_lme_fixef[1], 
             dur_c + ayT_lme_fixef[2]))+
    geom_point()+
    geom_hline(yintercept = 0, linetype = 2)+
    geom_vline(xintercept = 0, linetype = 2)+  
    theme_minimal()+
    coord_fixed()
```


```{r echo = F}
ayT_lme_fixef <- fixef(ayT_lme)

ranef(ayT_lme)$idstring %>%
  ggplot()+
       geom_abline(aes(intercept = `(Intercept)` + ayT_lme_fixef[1], 
                    slope = dur_c + ayT_lme_fixef[2]),
                alpha = 0.1)+
    xlim(-1, 2.85)+
    xlab("dur_c")+
    ylim(-2.14,5.9)+
    ylab("F1_n")+
    theme_minimal()
```

We can get an idea of what the mixed-effects modelling has done by comparing these results to the no-pooling model. 
First, the clustering of individual speakers' random effects is tighter. 
Also, no speaker is estimated to have an individual slope less than 0. 

It is also possible to see the positive correlation between the random effects.
Speakers who have a baseline lower `ayT` appear to be more affected by duration changes.


## Multiple random effects

This is where it gets impossible to make illustrative graphs.
It is also possible to include multiple grouping factors in a mixed effects model.

<div class = 'break box'>
<span class = 'big-label'>~5 minute break</span>

We're going to add `word` as a grouping factor. For now, we'll treat `word` as only being able to modulate the baseline `F1_n`, which means we'll add it with `(1 | word)` (a.k.a. a random intercept).

Go ahead and fit this model:

```{r}
ayT_lme_multi <- lmer(F1_n ~ dur_c + 
                       (1 + dur_c | idstring) + 
                       (1 | word), data = ayT)
summary(ayT_lme_multi)
```

What happened?

</div>


However, it is *possible* to include a random slope by word too, so let's do that as well:

```{r}
ayT_lme_multi2 <- ayT_lme_multi <- lmer(F1_n ~ dur_c + 
                       (1 + dur_c | idstring) + 
                       (1 + dur_c| word), data = ayT)
summary(ayT_lme_multi2)
```

One thing you might notice now is that the "fixed effect" of duration has gotten a bit smaller. 


```{r}
# complete pooling
coef(ayT_lm)
```

```{r}
# full ranef structure
fixef(ayT_lme_multi2)
```




This may look scary, especially if the effect size gradually vanishes away, but this is exactly how mixed effects models are supposed to work. 
In some sense, in the complete-pooling model, we *thought* we were modelling the effect of duration, but actually some of that effect was actually due to the peculiarities of the words spoken, and who the speakers were.


## When to add random intercepts/slopes?

[Barr et al (2012)](http://idiom.ucsd.edu/~rlevy/papers/barr-etal-2013-jml.pdf) recommend (command?) fitting models  with "maximal" random effects structures.
That is, for every possible grouping factor, include a random slope for every grouping factor that's logically possible.
I think this is a good starting point, but isn't always practical.

### Logically Possible & Impossible Random Slopes

In the `ayT` model we just fit, it was possible for us to fit a random slope of duration by speaker and word, because speakers produced tokens with different duration, and words were produced with different durations.

However, if we wanted to include `voicing` again (pre-voiceless vs elsewhere), we could include a random slope of voicing by speaker, but *not* by word.
This isn't a prohibition, it's just not logically possible and might blow up your model.
That's because each word only every has one voicing context.
In the word, `NICE`, the following /s/ is voiceless, and always will be.

Similarly, if we wanted to include year of birth in the model, we could include a random slope of dob by word, because many words were spoken by people across many ages. But we *couldn't* include a random slope of dob by speaker, because each speaker only has one date of birth.


### Necessary Random Effects

Any time you are looking at a between-grouping factor variable, you should include the grouping factor with a random intercept. This is *especially* true for any kind of research done outside of an experimental lab context.

- age: `(1 | speaker)`
- gender: `(1 | speaker)`
- word frequency: `(1 | word)`
- sentence grammaticality: `(1 | token)`

Other necessary random effects would be related to the nature of the data. In this `ah` model, we haven't explored any between-grouping factors, but all of these data points were produced by speakers, embedded in specific words, so it's important to capture that in the random effects.

### Recommended Random Effects

It is a good idea to include your biggest effects, or effects of greatest interest, as random slopes across the grouping factors which contain the most data.


# The assumptions of LMEs

The way LMEs work is they make some strong assumptions abut the distribution of the random effects. Let's look at the distribution of the parameters from the no-pooling model again (i.e. when we fit 1 model for each speaker).

```{r echo = F}
params %>%
  gather(param, est, `(Intercept)`:dur_c)->param_long
param_long %>% 
  ggplot(aes(est))+
    stat_density(alpha = 0.8, color = "black")+
    facet_wrap(~param, scales = "free")+
    xlab("estimate")+
    theme_bw()
```

These by-speaker parameters were estimated via *maximum likelihood*.
That is, these are the estimated slopes and intercepts that best satisfy the linear regression formula.

The random intercepts and slopes are *not* estimated in this way.
Rather, the model assumes that the random intercepts and slopes are drawn from a *normal distribution*. 
That is, the range of of possible random intercepts and slopes are *constrained*, so that they are consistent with being drawn from a normal distribution.
This is what allows the estimates for each speaker to be "partially pooled."
Even if there is not very much data from a single speaker, we can still get some sense of what their individual slope ought to be if we we look at the mean and standard deviation of the slopes across all speakers.

This also leads to "shrinkage" of the estimated random effects.
Compare the maximum likelihood estimates (in yellow) to the random effects (in grey) below.
For the random slope especially, the range of values is much narrower.

```{r echo = F}
library(ggthemes)
ranef(ayT_lme)$idstring %>%
  mutate(`(Intercept)` = `(Intercept)`+ ayT_lme_fixef[1],
          dur_c = dur_c + ayT_lme_fixef[2],
         estimator = "lme")%>%
  gather(param, est, `(Intercept)`:dur_c)%>%
  bind_rows(param_long %>% mutate(estimator = "ml"))%>%
  ggplot(aes(est))+
    geom_density(aes(fill = estimator), 
                 position = "identity",
                 alpha = 0.6)+
    facet_wrap(~param, scales="free")+
    scale_fill_colorblind()+
    theme_bw()
```

By and large, this shrinkage is a good thing.
Let's say that you ran a reaction time experiment, and then took your three subjects who pressed the button the fastest and the three subjects who pressed the button the slowest and re-tested them.
Almost invariably, the three fastest subjects will be a bit slower on the second test, and the three slowest a bit faster.
This is also known as "regression to the mean." 
You could consider this shrinkage of the random effects to be "baking in" this phenomenon.

# "Nuisance variables"

Another way to think about random effects is as "nuisance variables".
When modelling something as complex as why pronunciations change, there are always going to be factors missing from our models for a variety of reasons.

- there are things we know matter, but failed to measure them for some reason.
- there are things we know matter, but we don't know *how* to measure them.
- there are things that may not be *measureable*.
- there are things we don't realize matter.

If we don't do our best to account for these "unmodellables", they could very well mess with our fixed effects estimates. 
While they are not a perfect solution, random effects can go some way towards soaking up some of these unmodelable variables.

On the flip side, if you have something you *know* has an effect, include it in the model. 
For example, we know that gender has a big effect on many changes in progress.
You should, therefore, include it in the model instead of hoping that the random effects will account for it.
That's because if there is a big between-subject variable that affects the outcome variable, the assumption that speaker random intercepts & slopes are drawn from a single normal distribution is violated.


<hr />




